Predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data).
Raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type. We are dealing with a classification problem, containing labeled data (from 1 to 7 corresponding to the different Cover Type). Additionnaly, trainset includes ~500k samples, and data is numeric (we are not dealing with text data for which Naive Bayes could have been favored choice). Initial thougths leads to consider Ensemble Classifiers (RandomForest, XGBoost, Extratrees). Our goal here will be to compare these classifiers and tune their parameters to reach and optimal solution
Given algorithms considered for implemententation, choice naturally turned to sci-kit library. Indeed, ExtraTree is not implemented in pyspark ML Libraries
===================================================================================================================
===================================================================================================================
===================================================================================================================
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
from scipy.interpolate import BSpline
import numpy as np
import math as m
import random
import scipy.stats as stats
from scipy.stats import norm
import matplotlib.lines as mlines
import statistics as stat
from sklearn import ensemble
from pandas.plotting import scatter_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.ensemble import ExtraTreesClassifier
import statsmodels as sm
%matplotlib inline
plt.style.use('seaborn')
sns.set_style('darkgrid')
# Ignore message error arising on sns.pairplot
np.seterr(divide='ignore', invalid='ignore')
train_test_seed = 0
train = pd.read_csv("/home/theo/Dropbox/01. MASTER TELECOM PARIS/01. SD701 - Exploration de grands volumes de données/Final_Test/train-set.csv")
test = pd.read_csv("/home/theo/Dropbox/01. MASTER TELECOM PARIS/01. SD701 - Exploration de grands volumes de données/Final_Test/test-set.csv")
geological_zone = pd.read_csv("/home/theo/Dropbox/01. MASTER TELECOM PARIS/01. SD701 - Exploration de grands volumes de données/Final_Test/geological_zone.csv")
climatic_zone = pd.read_csv("/home/theo/Dropbox/01. MASTER TELECOM PARIS/01. SD701 - Exploration de grands volumes de données/Final_Test/climatic_zone.csv")
usfs_corr_tab = pd.read_csv("/home/theo/Dropbox/01. MASTER TELECOM PARIS/01. SD701 - Exploration de grands volumes de données/Final_Test/usfs_elu_code.csv", sep=";")
cover_type = train["Cover_Type"].unique()
cover_type.sort(axis=0)
cover_type = cover_type.tolist()
cover_colors = ["lightseagreen", "sandybrown", "lightgreen", "aquamarine", "darkgoldenrod", "lightsteelblue", "tomato"]
cover_labels = ["Spruce/Fir", "Lodgepole Pine", "Ponderosa Pine", "Cottonwood/Willow", "Aspen", "Douglas-fir", "Krummholz"]
df_cover_type = pd.DataFrame({"cover_indice" : cover_type, "cover_label" : cover_labels, "cover_color" : cover_colors})
train['Cover_Type_lab'] = train['Cover_Type'].map(df_cover_type.set_index('cover_indice')['cover_label'])
ID_train = train["Id"]
train.drop('Id', axis=1, inplace=True)
ID_test = test["Id"]
test.drop("Id", axis=1, inplace=True)
X = train.drop(["Cover_Type", "Cover_Type_lab"], axis=1)
Y = train["Cover_Type"]
We are dealing with a classification problem. A first step prior to (if relevant) proceed with feature engineering is to analyse the most important variable and the correlation between them
To do so, we will train a basic Random Forest Classifier on the train data, and analyse the most important features
initial_rf_clf = RandomForestClassifier(n_estimators=200, max_depth=40, random_state=0, n_jobs=-1)
initial_rf_clf.fit(X, Y)
The following command will extract the importance of each feature after the Random Forest classifier has been trained on the data
"Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature."
features_importance = initial_rf_clf.feature_importances_
# Removing Cover_Type indices and labels from the train columns to get the features labels
features_labels = train.columns[:-2]
df_features = pd.DataFrame({"feature" : features_labels, "importance" : features_importance})
df_features.sort_values("importance", ascending=True, inplace=True)
df_top10_features = df_features.tail(10)
y_pos = np.arange(df_top10_features.shape[0])
plt.barh(y_pos, df_top10_features["importance"], align='center', alpha=0.5)
plt.yticks(y_pos, df_top10_features["feature"])
plt.xlabel('Importance level')
plt.title('Top 10 Feature by importance (via Random Forest)')
plt.show()
===================================================================================================================
===================================================================================================================
columns_to_select = df_top10_features["feature"].append(pd.Series("Cover_Type_lab"))
train_top10 = train[columns_to_select]
matrix = sns.pairplot(train_top10, size=5, hue="Cover_Type_lab", diag_kind="kde", diag_kws=dict(shade=True))
plt.tick_params(labelsize=10)
plt.show()